{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Further Hypothesis Testing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select this cell and type Ctrl-Enter to execute the code below.\n",
    "\n",
    "library(tidyverse)\n",
    "\n",
    "set_plot_dimensions <- function(width_choice, height_choice) {\n",
    "    options(repr.plot.width=width_choice, repr.plot.height=height_choice)\n",
    "}\n",
    "\n",
    "cbPal <- c(\"#E69F00\", \"#56B4E9\", \"#009E73\", \"#F0E442\", \"#CC79A7\", \"#0072B2\", \"#D55E00\")\n",
    "\n",
    "set_plot_dimensions(5, 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You should see \"Attaching packages\" and some ticks by the packages loaded.\n",
    "# The \"Conflicts\" aren't a problem.\n",
    "\n",
    "# Other problems loading the library? Try running this cell.\n",
    "\n",
    "install.packages('tidyverse')\n",
    "\n",
    "library(tidyverse)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5 - Comparing distributions over a categorical variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run this cell to load the data.\n",
    "\n",
    "data <- read_csv(\"stars.csv\")\n",
    "\n",
    "type_key <- c('Brown Dwarf', 'Red Dwarf', 'White Dwarf', 'Main Sequence', 'Supergiant','Hypergiant')\n",
    "spectral_classes <- c('O','B','A','F','G','K','M')\n",
    "\n",
    "data$type <- factor(data$type)\n",
    "data$spectral_class <- factor(data$spectral_class, levels=spectral_classes)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your undergraduate project student, Tunde, is looking at the distributions of spectral class (i.e. colour) for various types of star. \n",
    "\n",
    "He says that the bar charts for white dwarves (type 2) and main sequence stars (type 3) seem similar, but you are not so sure. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "data %>% \n",
    "    filter(type %in% c(2,3)) %>% \n",
    "    select(type,spectral_class) %>%\n",
    "    ggplot(aes(x=spectral_class, fill=type)) +\n",
    "        geom_bar(position = position_dodge(preserve = \"single\")) +\n",
    "        scale_fill_manual(values=cbPal[c(3,4)]) +\n",
    "        scale_x_discrete(drop=FALSE)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question: do types 2 and 3 share the same distribution of spectral class?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To test differences in distributions over a *categorical variable* like spectral class, we can use [*Pearson's chi-squared test*](https://en.wikipedia.org/wiki/Chi-squared_test):"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Chi-squared test for statistical independence\n",
    "\n",
    "#### Theory\n",
    "\n",
    "$H_0$: Probability distribution for the categorical variable (e.g. spectral class) is independent of group (e.g. star type).\n",
    "\n",
    "$H_1$: Probability distribution depends on group.\n",
    "\n",
    "We need to find out whether the differences in observed frequencies between the two groups are small enough to have arisen by chance. We do this by constructing a [*contingency table*](https://en.wikipedia.org/wiki/Contingency_table) showing the observed frequency of each outcome for each of the two groups, and comparing to the *expected frequencies* under $H_0$.\n",
    "\n",
    "\n",
    "The test statistic is\n",
    "\n",
    "$$X^2 = \\sum^k_{i=1}{\\frac{(x_i-m_i)^2}{m_i}}$$,\n",
    "\n",
    "where $k$ is the number of cells in the table and $x_i$ and $m_i$ are the observed and expected frequencies for each cell.\n",
    "\n",
    "Under $H_0$, $X^2$ follows a [$\\chi^2$ distribution](https://en.wikipedia.org/wiki/Chi-squared_distribution), which is parametrised by the number of degrees of freedom. A contingency table of size $a \\times b$ has $(a-1)(b-1)$ degrees of freedom, i.e. the number of independent counts when row and column sums are held fixed.\n",
    "\n",
    "#### Assumptions\n",
    "\n",
    "- Sampling is random\n",
    "- Each sample is independent of the others\n",
    "- Expected frequency for each cell must be sufficiently large\n",
    "\n",
    "The approximation to the $\\chi^2$ distribution breaks down if expected frequencies are too low. It will normally be acceptable so long as no more than 20% of the events have expected frequencies below 5. Where there is only 1 degree of freedom (i.e. a 2 $\\times$ 2 table), the approximation is not reliable if expected frequencies fall below 10. \n",
    "\n",
    "\n",
    "#### Application"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will set $\\alpha=0.05$.\n",
    "\n",
    "We begin by making a contingency table for the observed spectral classes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# OBSERVED\n",
    "\n",
    "obs <-\n",
    "    data %>% \n",
    "    filter(type %in% c(2,3)) %>% \n",
    "    select(type,spectral_class) %>%\n",
    "    table(exclude=list(\"G\",\"M\",0,1,4,5))\n",
    "\n",
    "print(\"OBSERVED:\")\n",
    "\n",
    "print(obs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"row totals:\")\n",
    "tot_rows <- margin.table(obs,1)\n",
    "show(tot_rows)\n",
    "\n",
    "print(\"column totals:\")\n",
    "tot_cols <- margin.table(obs,2)\n",
    "print(tot_cols)\n",
    "\n",
    "print(\"total observations\")\n",
    "tot_obs <- sum(tot_rows)\n",
    "print(tot_obs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the row and column totals, we can calculate the *expected* frequencies under $H_0$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"EXPECTED:\")\n",
    "exp <- as.matrix(tot_rows) %*% t(as.matrix(tot_cols)) / tot_obs\n",
    "exp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The expected values for classes K and O are too small. We can combine columns O, F and K to get all of the expected values over 5:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# OBSERVED (combining columns)\n",
    "\n",
    "AB <- obs[,c('A','B')]\n",
    "OFK <- margin.table(obs[,c('O','F','K')],1)\n",
    "new_obs <- cbind(AB,OFK)\n",
    "print(\"OBSERVED:\")\n",
    "new_obs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# EXPECTED (combining columns)\n",
    "\n",
    "AB <- exp[,c('A','B')]\n",
    "OFK <- margin.table(exp[,c('O','F','K')],1)\n",
    "new_exp <- cbind(AB,OFK)\n",
    "print(\"EXPECTED:\")\n",
    "new_exp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In R, the most convenient form of the chi-squared test for a contingency table is `chi2.test`, which just takes the table of observations as input and calculates the expected values and degrees of freedom accordingly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chisq.test(new_obs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "According to the chi-squared test, $p<\\alpha$, so there is enough evidence to reject $H_0$:\n",
    "it appears that white dwarves and main sequence stars follow different distributions for spectral class.\n",
    "\n",
    "<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
